19 research outputs found

    Single channel speech separation with a frame-based pitch range estimation method in modulation frequency

    Get PDF
    Computational Auditory Scene Analysis (CASA) has attracted a lot of interest in segregating speech from monaural mixtures. In this paper, we propose a new method for single channel speech separation with frame-based pitch range estimation in modulation frequency domain. This range is estimated in each frame of modulation spectrum of speech by analyzing onsets and offsets. In the proposed method, target speaker is separated from interfering speaker by filtering the mixture signal with a mask extracted from the modulation spectrogram of mixture signal. Systematic evaluation shows an acceptable level of separation comparing with classic methods

    Social Focus of Attention as a Time Function Derived from Multimodal Signals

    Get PDF
    In this paper, we present the results of a study on the social focus of attention as a time function derived from the multisource multimodal signals, recorded by different personal capturing devices during social events. The core of the approach is based on fission and fusion of multichannel audio, video and social modalities to derive the social focus of attention. The results achieved to date on 16+ hours of real-life data prove the feasibility of the approach

    Determination of Pitch Range Based on Onset and Offset Analysis in Modulation Frequency Domain

    Get PDF
    Auditory scene in a natural environment contains multiple sources. Auditory scene analysis (ASA) is the process in which the auditory system segregates a scene into streams corresponding to different sources. The determination of range of pitch frequency is necessary for segmentation. We propose a system to determine the range of pitch frequency by analyzing onsets and offsets in modulation frequency domain. In the proposed system, first the modulation spectrum of speech is calculated and then, in each subband onsets and offsets will be detected. Thereafter, the segments are generated by matching corresponding onset and offset front. Finally, by choosing the desired segments, the rage of pitch frequency is determined. Systematic evaluation shows that the range of pitch frequency is estimated with good accuracy

    A Compressive Sensing Based Compressed Neural Network for Sound Source Localization

    Get PDF
    Microphone arrays are today employed to specify the sound source locations in numerous real time applications such as speech processing in large rooms or acoustic echo cancellation. Signal sources may exist in the near field or far field with respect to the microphones. Current Neural Networks (NNs) based source localization approaches assume far field narrowband sources. One of the important limitations of these NN-based approaches is making balance between computational complexity and the development of NNs; an architecture that is too large or too small will affect the performance in terms of generalization and computational cost. In the previous analysis, saliency subject has been employed to determine the most suitable structure, however, it is time-consuming and the performance is not robust. In this paper, a family of new algorithms for compression of NNs is presented based on Compressive Sampling (CS) theory. The proposed framework makes it possible to find a sparse structure for NNs, and then the designed neural network is compressed by using CS. The key difference between our algorithm and the state-of-the-art techniques is that the mapping is continuously done using the most effective features; therefore, the proposed method has a fast convergence. The empirical work demonstrates that the proposed algorithm is an effective alternative to traditional methods in terms of accuracy and computational complexity

    Speech Enhancement using an Improved MMSE Estimator with Laplacian Prior

    Get PDF
    In this paper we present an optimal estimator of magnitude spectrum for speech enhancement when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Chen has already introduced a Minimum Mean Square Error (MMSE) estimator of the magnitude spectrum. However, the proposed estimator, namely LapMMSE, does not have a closed form and is computationally extensive. We use his formulation for the MMSE estimator, employ some approximations and propose a computationally effective estimator for the magnitude spectrum. Experimental studies demonstrate better performance of our proposed estimator, Improved LapMMSE (ImpLapMMSE) Compared to LapMMSE and previous estimators in which Laplacian and Gaussian assumptions were made

    Speech Enhancement using Beta-order MMSE Spectral Amplitude Estimator with Laplacian Prior

    Get PDF
    This report addresses the problem of speech enhancement employing the Minimum Mean-Square Error (MMSE) of β-order Short Time Spectral Amplitude (STSA). We present an analytical solution for β-order MMSE estimator where Discrete Fourier Transform (DFT) coefficients of (clean) speech are modeled by Laplacian distributions. Using some approximations for the joint probability density function and the Bessel function, we also present a closed-form version of the estimator (called β-order LapMMSE). The performance of the proposed estimator is compared to the state-of-the–art spectral amplitude estimators that assume Gaussian priors for clean DFT coefficients. Comparative results demonstrate the superiority of the proposed estimator in terms of speech enhancement/ noise reduction measures

    An Integrated Framework for Multi-Channel Multi-Source Localization and Voice Activity Detection

    Get PDF
    Two of the major challenges in microphone array based adaptive beamforming, speech enhancement and distant speech recognition, are robust and accurate source localization and voice activity detection. This paper introduces a spatial gradient steered response power using the phase transform (SRP-PHAT) method which is capable of localization of competing speakers in overlapping conditions. We further investigate the behaviour of the SRP function and characterize theoretically a fixed point in its search space for the diffuse noise field. We call this fixed point the null position in the SRP search space. Building on this evidence, we propose a technique for multi- channel voice activity detection (MVAD) based on detection of a maximum power corresponding to the null position. The gradient SRP-PHAT in tandem with the MVAD form an integrated framework of multi-source localization and voice activity detection. The experiments carried out on real data recordings show that this framework is very effective in practical applications of hands-free communication

    AN INTEGRATED FRAMEWORK FOR MULTI-CHANNEL MULTI-SOURCE LOCALIZATION AND VOICE ACTIVITY DETECTION

    Get PDF
    Two of the major challenges in microphone array based adap- tive beamforming, speech enhancement and distant speech recognition, are robust and accurate source localization and voice activity detection. This paper introduces a spatial gra- dient steered response power using the phase transform (SRP- PHAT) method which is capable of localization of competing speakers in overlapping conditions. We further investigate the behavior of the SRP function and characterize theoretically a fixed point in its search space for the diffuse noise field. We call this fixed point the null position in the SRP search space. Building on this evidence, we propose a technique for multi- channel voice activity detection (MVAD) based on detection of a maximum power corresponding to the null position. The gradient SRP-PHAT in tandem with the MVAD form an inte- grated framework of multi-source localization and voice ac- tivity detection. The experiments carried out on real data recordings show that this framework is very effective in prac- tical applications of hands-free communication
    corecore